1. Field of the Invention
The present invention relates to a method and apparatus for automatic document recognition and, more particularly, to a method for automatically determining the language of the characters of the document.
2. Description of Related Art
Optical character recognition and the use of optical character recognition to convert scanned image data into text data suitable for use in a digital computer is well known. In addition, methods for converting scanned image data into text data and the types of errors such methods generate are well known. However, the proper selection of the character recognition method is highly dependent upon the script (alphabet or ideogram) used in the document. Further, the selection of a proper method for error correction is highly dependent upon the language of the document. Conventionally, the methods for optical character recognition and for error correction in optical character recognition systems have been provided on the assumption that the script and language used in the document is the usual script and language of the country in which the system is being used. That is, in the United States, conventional optical character recognition systems would assume that the document is in English and uses Roman script, while in Japan, an applicable character recognition system would be implemented assuming that the language is Japanese and uses the Japanese scripts. Alternately, an optical character recognition system can be implemented with the character recognition and error resolution methods for a plurality of languages.
However, it has heretofore not been possible to have the optical character recognition system automatically determine the script type and/or language of the document. Rather, as each document is provided to the optical character recognition system, some indication of the particular language and script of the document must be given to the optical character recognition system. This has been accomplished by either having the operator input data concerning the language and script of the document to the optical character recognition system, or by having the document provided with special markings which indicate the language and script of the document.